import sys
from pathlib import Path
# Relative project root path
= Path("../..")
project_root = project_root / "5-API" / "src"
src_path = project_root / "data" / "BIRD"
dataset_path
# Add src directory to Python path
str(src_path.resolve())) sys.path.append(
Approaches and Challenges in Annotating a Closed Domain NER Dataset
1 Introduction
1.1 Named Entity Recognition
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) that involves identifying and classifying entities in text into predefined categories such as person names, organizations, locations, etc.
1.2 Closed Domain NER
Closed Domain Named Entity Recognition (CD-NER) involves extracting entities from text that correspond to elements of a structured database, such as table names, column names, or partial tuple values. This domain-specific set can contain billions of entities, making extraction a significant challenge. The primary difficulty lies in accurately identifying entities within this closed set while managing the complexities of database size and specificity. CD-NER requires handling specialized vocabulary, leveraging domain-specific context, and dealing with a large fixed pool of entities.
1.3 Benchmarks
In the field of text-to-SQL translation, benchmark datasets like BIRD and Spider have advanced research and established baselines. However, the lack of high-quality CD-NER benchmark datasets limits progress in this area. This article addresses this gap by converting text-to-SQL benchmarks into CD-NER benchmarks. By leveraging structured features from text-to-SQL datasets, we aim to provide a reliable evaluation resource for closed-domain entity extraction.
2 BIRD Dataset
We’ll be working with the BIRD dataset, which contains natural language questions paired with SQL queries.
from cdner.datasets import BirdDataset
from cdner.annotators.pglast_annotator import PglastAnnotator
# Initialize the dataset
= BirdDataset(root=dataset_path, train=True).load()
dataset = list(dataset.examples)
examples_list print(f"Number of samples in the dataset: {len(examples_list)}")
Number of samples in the dataset: 9428
from pprint import pprint
# Display the first sample
0].model_dump()) pprint(examples_list[
{'db_id': 'movie_platform',
'id': 'bird:train.json:0',
'query': 'SELECT movie_title FROM movies WHERE movie_release_year = 1945 '
'ORDER BY movie_popularity DESC LIMIT 1',
'question': 'Name movie titles released in year 1945. Sort the listing by the '
'descending order of movie popularity.'}
3 Building CD-NER Benchmarks
To transform BIRD to CDNER, we map sentence
(question text) to lexemes
(SQL query entities).
This is broken down into the following steps:
3.1 Extracting Lexemes
We begin by parsing the SQL queries using pglast
, a Python library that parses PostgreSQL SQL statements into an Abstract Syntax Tree (AST). This AST representation allows us to navigate the structure of the SQL queries and extract:
Tables
: Identified by navigating RangeVar nodes in the AST.Columns
: Extracted from ColumnRef nodes.Values
: Retrieved from A_Const nodes representing constants in the query.
This provides a candidate list that needs to be matched with the question text. We call the extracted entities lexemes
.
3.2 Matching Lexemes
Once we have the list of lexemes
, the next step is to match them with substrings in the corresponding natural language question or sentence
. Direct string matching is often insufficient due to variations in phrasing, synonyms, or differences in tokenization. To address this, we use a convolutional search with fuzzy string matching:
- Tokenization: The question text is tokenized, preserving the position of each token for accurate mapping.
- Convolutional Search: We slide a window over the tokens to consider all possible substrings of varying lengths.
- Fuzzy Matching: For each substring, we compute a similarity score with the entity using metrics like the token sort ratio from the thefuzz library.
3.3 Annotating the sentence
We annotate each sentence
with: - Start and End Positions: Indicating the exact location of the entity in the question. - Label Type: Denoting whether the entity is a table
, column
, or value
. - Lexeme: The original entity extracted from the SQL query. - Similarity Score: Reflecting the confidence of the match.
3.4 Applying the BIO Tagging
Finally, we convert the annotated entities into a BIO tagging format:
B-Label
: Marks the beginning of an entity.I-Label
: Marks tokens inside an entity.O
: Marks tokens outside any entity.
4 Step-by-Step Example
4.0.1 Extracting Lexemes
Using pglast
, we parse the SQL query and extract the following lexemes:
= PglastAnnotator()
annotator
= annotator.annotate(examples_list[0])
annotated_example pprint(annotated_example.model_dump())
{'entities': [{'end': 34,
'label_type': 'column',
'lexeme': 'movie_release_year',
'schema_element': None,
'similarity': 0.77,
'start': 5,
'substring': 'movie titles released in year'},
{'end': 101,
'label_type': 'column',
'lexeme': 'movie_popularity',
'schema_element': None,
'similarity': 1.0,
'start': 85,
'substring': 'movie popularity'},
{'end': 17,
'label_type': 'column',
'lexeme': 'movie_title',
'schema_element': None,
'similarity': 0.96,
'start': 5,
'substring': 'movie titles'},
{'end': 10,
'label_type': 'table',
'lexeme': 'movies',
'schema_element': None,
'similarity': 0.91,
'start': 5,
'substring': 'movie'},
{'end': 39,
'label_type': 'value',
'lexeme': '1945',
'schema_element': None,
'similarity': 1.0,
'start': 35,
'substring': '1945'}],
'id': 'bird:train.json:0',
'question': 'Name movie titles released in year 1945. Sort the listing by the '
'descending order of movie popularity.'}
4.0.2 Matching Lexemes
We use convolutional search with fuzzy matching to align lexemes
with segments of the sentence
(question text). The matching process identifies the most similar substring within a sliding window across the sentence, based on a similarity threshold:
= "Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity."
sentence = [
lexemes 'column', 'movie_release_year'),
(# ('column', 'movie_popularity'),
# ('column', 'movie_title'),
# ('table', 'movies'),
# ('value', '1945')
]print("Sentence =",sentence)
print("lexemes =", lexemes)
# Set a similarity threshold
= 0.8
threshold
# Perform matching
print("Starting the matching process:")
= conv_match_substring(sentence, lexemes, threshold=threshold)
entities
# Display the matched entities
print("\nMatched entities:")
for entity in entities:
print(f"Entity Type: {entity.label_type}")
print(f"Matched Text: '{sentence[entity.start:entity.end]}'")
print(f"Lexeme: {entity.lexeme}")
print(f"Similarity: {entity.similarity}\n")
Sentence = Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity.
lexemes = [('column', 'movie_release_year')]
Starting the matching process:
Matching lexeme 'movie_release_year' of type 'column'
Searching for best match for phrase 'movie_release_year' in sentence.
Window 'Name movie titles released in' (Tokens 0-5): Similarity = 0.64
Window 'movie titles released in year' (Tokens 1-6): Similarity = 0.77
Window 'titles released in year 1945' (Tokens 2-7): Similarity = 0.61
Window 'released in year 1945. Sort' (Tokens 3-8): Similarity = 0.64
Window 'in year 1945. Sort the' (Tokens 4-9): Similarity = 0.46
Window 'year 1945. Sort the listing' (Tokens 5-10): Similarity = 0.41
Window '1945. Sort the listing by' (Tokens 6-11): Similarity = 0.24
Window 'Sort the listing by the' (Tokens 7-12): Similarity = 0.29
Window 'the listing by the descending' (Tokens 8-13): Similarity = 0.3
Window 'listing by the descending order' (Tokens 9-14): Similarity = 0.29
Window 'by the descending order of' (Tokens 10-15): Similarity = 0.32
Window 'the descending order of movie' (Tokens 11-16): Similarity = 0.43
Window 'descending order of movie popularity' (Tokens 12-17): Similarity = 0.41
Window size 5: Best match 'movie titles released in year' with similarity 0.77
Window 'Name movie titles released' (Tokens 0-4): Similarity = 0.68
Window 'movie titles released in' (Tokens 1-5): Similarity = 0.71
Window 'titles released in year' (Tokens 2-6): Similarity = 0.68
Window 'released in year 1945' (Tokens 3-7): Similarity = 0.72
Window 'in year 1945. Sort' (Tokens 4-8): Similarity = 0.46
Window 'year 1945. Sort the' (Tokens 5-9): Similarity = 0.44
Window '1945. Sort the listing' (Tokens 6-10): Similarity = 0.26
Window 'Sort the listing by' (Tokens 7-11): Similarity = 0.27
Window 'the listing by the' (Tokens 8-12): Similarity = 0.33
Window 'listing by the descending' (Tokens 9-13): Similarity = 0.28
Window 'by the descending order' (Tokens 10-14): Similarity = 0.34
Window 'the descending order of' (Tokens 11-15): Similarity = 0.29
Window 'descending order of movie' (Tokens 12-16): Similarity = 0.42
Window 'order of movie popularity' (Tokens 13-17): Similarity = 0.51
Window size 4: Best match 'released in year 1945' with similarity 0.72
Window 'Name movie titles' (Tokens 0-3): Similarity = 0.57
Window 'movie titles released' (Tokens 1-4): Similarity = 0.77
Window 'titles released in' (Tokens 2-5): Similarity = 0.61
Window 'released in year' (Tokens 3-6): Similarity = 0.82
Window 'in year 1945' (Tokens 4-7): Similarity = 0.4
Window 'year 1945. Sort' (Tokens 5-8): Similarity = 0.44
Window '1945. Sort the' (Tokens 6-9): Similarity = 0.26
Window 'Sort the listing' (Tokens 7-10): Similarity = 0.29
Window 'the listing by' (Tokens 8-11): Similarity = 0.31
Window 'listing by the' (Tokens 9-12): Similarity = 0.31
Window 'by the descending' (Tokens 10-13): Similarity = 0.34
Window 'the descending order' (Tokens 11-14): Similarity = 0.32
Window 'descending order of' (Tokens 12-15): Similarity = 0.32
Window 'order of movie' (Tokens 13-16): Similarity = 0.56
Window 'of movie popularity' (Tokens 14-17): Similarity = 0.49
Window size 3: Best match 'released in year' with similarity 0.82
Window 'Name movie' (Tokens 0-2): Similarity = 0.57
Window 'movie titles' (Tokens 1-3): Similarity = 0.6
Window 'titles released' (Tokens 2-4): Similarity = 0.55
Window 'released in' (Tokens 3-5): Similarity = 0.62
Window 'in year' (Tokens 4-6): Similarity = 0.48
Window 'year 1945' (Tokens 5-7): Similarity = 0.37
Window '1945. Sort' (Tokens 6-8): Similarity = 0.22
Window 'Sort the' (Tokens 7-9): Similarity = 0.31
Window 'the listing' (Tokens 8-10): Similarity = 0.28
Window 'listing by' (Tokens 9-11): Similarity = 0.21
Window 'by the' (Tokens 10-12): Similarity = 0.17
Window 'the descending' (Tokens 11-13): Similarity = 0.31
Window 'descending order' (Tokens 12-14): Similarity = 0.35
Window 'order of' (Tokens 13-15): Similarity = 0.38
Window 'of movie' (Tokens 14-16): Similarity = 0.46
Window 'movie popularity' (Tokens 15-17): Similarity = 0.53
Window size 2: Best match 'released in' with similarity 0.62
Window 'Name' (Tokens 0-1): Similarity = 0.18
Window 'movie' (Tokens 1-2): Similarity = 0.43
Window 'titles' (Tokens 2-3): Similarity = 0.33
Window 'released' (Tokens 3-4): Similarity = 0.54
Window 'in' (Tokens 4-5): Similarity = 0.1
Window 'year' (Tokens 5-6): Similarity = 0.36
Window '1945' (Tokens 6-7): Similarity = 0.0
Window 'Sort' (Tokens 7-8): Similarity = 0.18
Window 'the' (Tokens 8-9): Similarity = 0.1
Window 'listing' (Tokens 9-10): Similarity = 0.16
Window 'by' (Tokens 10-11): Similarity = 0.1
Window 'the' (Tokens 11-12): Similarity = 0.1
Window 'descending' (Tokens 12-13): Similarity = 0.21
Window 'order' (Tokens 13-14): Similarity = 0.35
Window 'of' (Tokens 14-15): Similarity = 0.1
Window 'movie' (Tokens 15-16): Similarity = 0.43
Window 'popularity' (Tokens 16-17): Similarity = 0.29
Window size 1: Best match 'released' with similarity 0.54
Best overall match: 'released in year' with similarity 0.82
Matched 'released in year' in sentence with similarity 0.82
Matched entities:
Entity Type: column
Matched Text: 'released in year'
Lexeme: movie_release_year
Similarity: 0.82
5 Challenges
- Alignment Issues:
- Natural language questions often use varied phrasing that doesn’t directly match the
lexemes
(e.g., table names, column names) in the database schema.
- Natural language questions often use varied phrasing that doesn’t directly match the
- Overlapping Entities:
- When multiple entities are mentioned closely together in a question, their textual representations can overlap.
We can approach the challenge in the following ways:
- Alignment
- Continous Annotation: Only continuous (adjacent) substrings in the sentence can be annotated as entities. This means that the words corresponding to an entity must be next to each other without any interruptions.
- Non-continuous Annotation: Allows for the annotation of entities even if the corresponding words are not adjacent in the sentence. This approach is more flexible and can capture entities that are mentioned in a scattered manner throughout the sentence.
- Overlap
- Overlap Annotation: Annotations are allowed to overlap in the sentence; that is, a word or phrase can be part of multiple entity annotations. This is useful when different entities share common words in the question.
- Non-overlap Annotation: Annotations cannot overlap; each word or phrase can be assigned to at most one entity. This constraint ensures that once a word is part of an entity annotation, it cannot be part of another.
5.1 Example
Suppose the following sentence
and lexemes
:
= "Name movie titles released in year 1945. Sort the listing by the descending order of movie popularity."
sentence = [
lexemes 'column', 'movie_release_year'),
('column', 'movie_popularity'),
('column', 'movie_title'),
('table', 'movies'),
('value', '1945')
( ]
5.2 Continous Overlapping
Definition: Only continuous substrings in the sentence can be annotated as entities, and annotations are allowed to overlap (i.e., a word or phrase can be part of multiple annotations).
Sentence: “Name [{movie} titles]
[released in year]
[1945]
. Sort the listing by the descending order of [movie popularity]
.”
{movie}
and[movie released in year]
matches to the ‘movie_release_year’ column.[movie titles]
matches the ‘movie_title’ column.[1945]
matches the ‘1945’ value.[movie popularity]
matches the ‘movie_popularity’ column.
5.3 Continuous Non-Overlapping
Definition: Only continuous substrings in the sentence can be annotated as entities, and annotations cannot overlap (i.e., each word or phrase can be part of at most one annotation).
Sentence: “Name [movie titles]
[released in year]
[1945]
. Sort the listing by the descending order of [movie popularity]
.”
[movie titles]
matches the ‘movie_title’ column.[released in year]
matches the ‘movie_release_year’ column.[1945]
matches the ‘1945’ value.[movie popularity]
matches the ‘movie_popularity’ column.
5.4 Non-Continuous Overlapping
Definition: Substrings can be non-continuous (i.e., words corresponding to an entity do not need to be adjacent), and annotations are allowed to overlap.
Sentence: “Name [{movie} titles]
[released]
in [year]
[1945]
. Sort the listing by the descending order of [{movie} popularity]
.”
{movie}
matches the ‘movies’ table.[movie titles]
separately match the ‘movie_title’ columns.{movie}
,[released]
and[year]
correspond to the ‘movie_release_year’ columns.[1945]
matches the ‘1945’ value.[movie popularity]
correspond to the ‘movie_popularity’ column.
5.5 Non-Continuous Non-Overlapping
Definition: Substrings can be non-continuous, and annotations cannot overlap.
Sentence: “Name [movie]
[titles]
[released]
in [year]
[1945]
. Sort the listing by the descending order of [movie]
[popularity]
.”
[movie]
matches the ‘movies’ table or ‘movie_title’ column.[titles]
matches the ‘movie_title’ column.[released]
matches the ‘movie_release_year’ column.[year]
matches the ‘movie_release_year’ column.[1945]
matches the ‘1945’ value.[movie]
[popularity]
matches the ‘movie_popularity’ column.